# load the package
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.3.5 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.7
## ✔ tidyr 1.1.3 ✔ stringr 1.4.0
## ✔ readr 1.4.0 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
1. Run ggplot(data = mpg). What did you see?
data("mpg")
ggplot(data = mpg)
It creates a empty graph.
2. How many rows are in mtcars? How many columns?
data("mtcars")
dim(mtcars)
## [1] 32 11
There are 32 rows and 11 columns in mtcars.
3. What does the drv valuable describe? Read the help ?mpg to find out.
?mpg
drv is the type of the drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.
4. Make a scatterplot of hwy versus cyl.
ggplot(data = mpg) +
geom_point(mapping = aes(x = as.factor(cyl), y = hwy, color = as.factor(cyl))) +
ggtitle("Scatterplot of Highway Miles per Gallon versus Number of Cylinders") +
xlab("Number of Cylinders") +
ylab("Highway Miles per Gallon") +
labs(color = "Number of Cylinders")
5. What happens if you make a scatterplot of class versus drv? Why is the plot not useful?
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = class, color = drv)) +
ggtitle("Scatterplot of the Type of car versus Type of Drive Train") +
xlab("Type of Drive Train") +
ylab("Type of car ")
We can see only one point appears for each combination of drive train and car because observations with such combination are overlapped in this graph. Therefore the plot is not useful for us to analyze.
1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
In this case, the aesthetic treats blue as a variable which we want to make classification and the first color of for classifying categories is red, therefore the points are not blue.
2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
?mpg
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
As we can see, manufacturer, model, trans, drv, fl, class are categorical. displ, cty, hwy are continuous (I think year, cyl should be discrete).
3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cyl))
# ggplot(data = mpg) +
# geom_point(mapping = aes(x = displ, y = hwy, shape = cyl))
We map a continuous variable cyl, i.e. number of cylinders to color, size, and shape. Different from categorical variables, the system assigns colors/sizes for ranges of this continuous variable instead of each of its values. In addition, a continuous variable can not be mapped to shape.
4. What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cyl, color = cyl))
We can see both aesthetics are shown in this plot.
5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
?geom_point
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, stroke = cyl))
Stroke aesthetic is used to modify the width of the border.
6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))
We can see it creates a binary outcome for our mapping and set colors according to this binary outcome.
1. What happens if you facet on a continuous variable?
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy)) +
facet_wrap(~ displ)
We can see for each value of the continuous variable, facet_wrap creates a plot for two variables in the mapping.
2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
The empty cells mean that there are no combinations of those two variables in the facet_grid. Meanwhile, we can also see the subplots with scatters if points of such combinations show in the second plot.
3. What plots does the following code make? What does . do?
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
For the first plot, we facet drv in the rows dimensions (not facet in columns dimensions) since we set . in the column part. In addition, we facet ‘cyl’ in the columns dimensions (not facet in rows columns) since we set . in the row part.
4. Take the first faceted plot in this section:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Advantages of using faceting is that it is useful for categorical variables to split your plot into facets, subplots that each displays one subset of the data making focus on particular facets alone. In contrast, colour aesthetic displaying multiple colors with increasing categorical features can cause confusion. However, facets can be hard if we want to how each combination behaves in the whole dataset.
5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?
?facet_wrap
In facet_wrap, nrow and ncol are used to control the numbers of rows and columns of the whole layout. Other options that are used to control the layout of individual panels can be checked by ?facet_wrap. facet_grid() forms a matrix of panels defined by row and column variables so the number of rows and columns are fixed.
6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?
If we put more levels on the row axis, then the y-axis would shrink and it is harder to see which actual values are at the points as shown in the plot.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(class ~ cyl)
# class has more categorical levels, y axis shrank
# and we neither got the full name of each level nor
# accurate values
1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
line chart: geom_line; histogram: geom_histogram; area chart: geom_polygon.
2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This code will give us two different aesthetics - scatterplot and line based on the same global layer: hwy versus displ colored by drv.
3. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?
show.legend = FALSE does not allow you showing legend next to the plot, and it will come back if you remove the code. The reason why used this code is you displayed three plots in a row and it would cause layout issue if you add a legend on the right of the plot.
4. What does the se argument to geom_smooth() do?
se means standard error, which allows you calculating confidence intervals around smooth.
5. Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
These two graphs will look the same because the first part of code uses a global layer for two aesthetics, while the second part uses the same layer for two aesthetics.
6. Recreate the R code necessary to generate the following graphs.
# Plot 1
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Plot 2
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Plot 3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Plot 4
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
geom_smooth(mapping = aes(x = displ, y = hwy), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Plot 5
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Plot 6
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "white", size = 3) +
geom_point(mapping = aes(color = drv))
1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
The default geom associated with stat_summary is pointrange. Code can be rewritten as following:
tab26 <- diamonds %>%
group_by(cut) %>%
summarise(min = min(depth), max = max(depth), median = median(depth))
ggplot(data = tab26) +
geom_pointrange(mapping = aes(x = cut, ymin = min, ymax = max, y = median))
2. What does geom_col() do? How is it different to geom_bar()?
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
ggplot(data = diamonds) +
geom_col(mapping = aes(x = cut, y = depth))
Like geom_bar, geom_col also provides us with bar charts. The main difference is that geom_bar uses stat_identity as default, meaning that we want the heights of the bars to represent values in the data. However, geom_bar makes the height of the bar proportional to the number of cases in each group.
3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
Check this website.
4. What variables does stat_smooth() compute? What parameters control its behavior?
stat_smooth() computes the following variables:
y or x – predicted value
ymin or xmin – lower pointwise confidence interval around the mean
ymax or xmax – upper pointwise confidence interval around the mean
se – standard error
\(\textbf{Parameters which control its behavior: }\)
method – Smoothing method (function) that we want to use, e.g. “lm”, “glm”, “gam”, “loess”…
formula – formula to use in smoothing function, e.g. “y ~ x”, “y ~ poly(x, 2)”, “y ~ log(x)”…
se – whether display confidence interval or not
na.rm – remove missing values with/without a warning
geom, stat – override the default connection between geom_smooth() and stat_smooth()
n – set the number of points at which to evaluate smoother
span – controls the amount of smoothing for the default loess smoother
level – level of confidence interval to use
method.args – List of additional arguments passed on to the modeling function defined by method
5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
If group = 1 is not included, then all bars in the plot will have the same height 1. The function geom_bar() assumes that the groups are equal to the x values since the stat computes the counts within the group.
1. What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
There should be 234 points for 234 observations in the dataset. However, some of them overlap with each other, whcih causes the “overplotting problem”. We can add some random noises to avoid this problem.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = "jitter")
2. What parameters to geom_jitter() control the amount of jittering?
width and height – Control the amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here. If omitted, defaults to 40% of the resolution of the data: this means the jitter values will occupy 80% of the implied bins. Categorical data is aligned on the integers, so a width or height of 0.5 will spread the data so it’s not possible to see the distinction between the categories.
3. Compare and contrast geom_jitter() with geom_count()
geom_jitter adds a small amount of random variation to the location of each point, and is a useful way of handling the overplotting problem. The point is geom_jitter gives random noises to each of the observations to avoid overlapping. However, geom_count counts the number of observations for each combination and provide different size of points.
4. What’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.
The default position adjustment for geom_boxplot is “dodge2”. This position adjustment does not change the vertical position of a geom but moves the geom horizontally to avoid overlapping.
ggplot(data = mpg, mapping = aes(x = drv, y = displ)) +
geom_boxplot(mapping = aes(color = class))
If we set the position to be “identity” it will show a overlapping boxplots.
ggplot(data = mpg, mapping = aes(x = drv, y = displ)) +
geom_boxplot(mapping = aes(color = class), position = "identity")
1. Turn a stacked bar chart into a pie chart using coord_polar().
bar33 <- ggplot(data = mpg, mapping = aes(x = class)) +
geom_bar(mapping = aes(fill = class),
width = 1,
show.legend = FALSE) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar33 + coord_polar()
2. What does labs() do? Read the documentation.
labs is used for modifying axis, legend and plot labels. Check using ?labs.
3. What’s the difference between coord_quickmap() and coord_map()?
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_map()
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_map() projects a portion of the earth, which is approximately spherical, onto a flat 2D plane. Map project does not preserve straight lines, so this requires considerable computation. coord_quickmap is a faster approximation that preserves straight lines.
4. What does the plot below tell you about the relationship between cty and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
The cty and highway mpg are positively correlated. coord_fixed() forces a specified ratio between the physical representation of data units on the axes. The default, ratio = 1, ensures that one unit on the x-axis is the same length as one unit on the y-axis. geom_abline offers reference lines to a plot, either horizontal, vertical, or diagonal. They are useful for annotating plots.
1. Why does this code not work?
my_variable <- 10
# my_varıable
#> Error in eval(expr, envir, enclos): object 'my_varıable' not found
Two objects do not have the same name, so my_varıable could not find value 10.
2. Tweak each of the following R commands so that they run correctly:
ggplot(data = mpg) + # dota -> data
geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8) # fliter -> filter, = -> ==
## # A tibble: 70 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
## 2 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 3 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
## 4 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 5 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
## 6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
## 7 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
## 8 chevrolet corvette 5.7 1999 8 auto… r 15 23 p 2sea…
## 9 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
## 10 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
## # … with 60 more rows
filter(diamonds, carat > 3) # diamond -> diamonds
## # A tibble: 32 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.4 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # … with 22 more rows
3. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?
It gives us keyboard shortcut quick reference. To get the same reference, check Tools - Keyboard shortcut help.